Lab 2: Statistical inference & hypothesis testing

Practice session covering topics discussed in Lecture 2

M. Chiara Mimmi, Ph.D. | Università degli Studi di Pavia

July 26, 2024

GOAL OF TODAY’S PRACTICE SESSION

Consolidate understanding of inferential statistic, through R coding examples conducted on real biostatistics research data.



Lecture 2: topics

  • Purpose and foundations of inferential statistics
  • Getting to know the “language” of hypothesis testing
  • Hypothesis testing
    • review examples
  • A closer look at testing assumptions
    • more examples dealing with assumptions’ violation

R ENVIRONMENT SET UP & DATA

Needed R Packages

  • We will use functions from packages base, utils, and stats (pre-installed and pre-loaded)
  • We will also use the packages below (specifying package::function for clarity).
# Load them for this R session
library(fs)           # file/directory interactions
library(here)         # tools find your project's files, based on working directory
library(janitor)      # tools for examining and cleaning data
library(dplyr)        # {tidyverse} tools for manipulating and summarising tidy data 
library(forcats)      # {tidyverse} tool for handling factors
   
library(rstatix)      # Pipe-Friendly Framework for Basic Statistical Tests
library(car)          # Companion to Applied Regression
library(broom)        # Convert Statistical Objects into Tidy Tibbles 
library(multcomp)     # Simultaneous Inference in General Parametric Models 

library(ggplot2)      # {tidyverse} tools for plotting
#library(ggstatsplot) # 'ggplot2' Based Plots with Statistical Details 
library(ggpubr)       # 'ggplot2' Based Publication Ready Plots 
library(gridExtra)    # Miscellaneous Functions for "Grid" Graphics

library(hrbrthemes)   # Additional Themes, Theme Components and Utilities for 'ggplot2' 
library(viridis)      # Colorblind-Friendly Color Maps for R 
#library(ggridges)    # alternative to plot density functions 
library(ggthemes)    # Extra Themes, Scales and Geoms for 'ggplot2'
library(RColorBrewer) # ColorBrewer Palettes 
#library(kableExtra)  # Construct Complex Table with 'kable' and Pipe Syntax

Our datasets for today

For the most part, we will refer to a real clinical dataset (for which a Creative Commons license was granted) discussed in two articles (also open access) :

  • Ahmad, T., Munir, A., Bhatti, S. H., Aftab, M., & Raza, M. A. (2017). Survival analysis of heart failure patients: A case study. PLOS ONE, 12(7), e0181001. https://doi.org/10.1371/journal.pone.0181001
  • Chicco, D., & Jurman, G. (2020). Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone. BMC Medical Informatics and Decision Making, 20(1), 16. https://doi.org/10.1186/s12911-020-1023-5

Here is the link to the dataset (or download from workshop website)



Importing from your project folder (previously downloaded file)

Tip

Make sure to match your own folder structure!


  • The function here lets me specify the complete path of the destination folder
# Check my working directory location
# here::here()

# Use `here` in specifying all the subfolders AFTER the working directory 
heart_failure <- read.csv(file = here::here("practice", "data_input", "02_datasets",
                                      "heart_failure_clinical_records_dataset.csv"), 
                          header = TRUE, # 1st line is the name of the variables
                          sep = ",", # which is the field separator character.
                          na.strings = c("?","NA" ), # specific MISSING values  
                          row.names = NULL) 

INSPECTING THE “HEART FAILURE” DATASET

What are the variables and their levels of measurement?

The data, containing the medical records of 299 heart failure patient, were collected at the Faisalabad Institute of Cardiology and at the Allied Hospital in Faisalabad (Punjab, Pakistan), during April–December 2015.

Table 1 from the second article (Chicco & Jurman, 2020, p. 3) offers a synthetic explanation of hte observed variables.

Look into the dataset just loaded in the R environment

Recall some base R functions from Lab 1

# What variables are included in this dataset?
colnames(heart_failure)
 [1] "age"                      "anaemia"                 
 [3] "creatinine_phosphokinase" "diabetes"                
 [5] "ejection_fraction"        "high_blood_pressure"     
 [7] "platelets"                "serum_creatinine"        
 [9] "serum_sodium"             "sex"                     
[11] "smoking"                  "time"                    
[13] "DEATH_EVENT"             
# How many observations & variables?
nrow(heart_failure)
[1] 299
# How many rows & columns?
dim(heart_failure)
[1] 299  13

Inspect the dataframe structure (base R)

# What does the dataframe look like?
str(heart_failure)
'data.frame':   299 obs. of  13 variables:
 $ age                     : num  75 55 65 50 65 90 75 60 65 80 ...
 $ anaemia                 : int  0 0 0 1 1 1 1 1 0 1 ...
 $ creatinine_phosphokinase: int  582 7861 146 111 160 47 246 315 157 123 ...
 $ diabetes                : int  0 0 0 0 1 0 0 1 0 0 ...
 $ ejection_fraction       : int  20 38 20 20 20 40 15 60 65 35 ...
 $ high_blood_pressure     : int  1 0 0 0 0 1 0 0 0 1 ...
 $ platelets               : num  265000 263358 162000 210000 327000 ...
 $ serum_creatinine        : num  1.9 1.1 1.3 1.9 2.7 2.1 1.2 1.1 1.5 9.4 ...
 $ serum_sodium            : int  130 136 129 137 116 132 137 131 138 133 ...
 $ sex                     : int  1 1 1 1 0 1 1 1 0 1 ...
 $ smoking                 : int  0 0 1 0 0 1 0 1 0 1 ...
 $ time                    : int  4 6 7 7 8 8 10 10 10 10 ...
 $ DEATH_EVENT             : int  1 1 1 1 1 1 1 1 1 1 ...

Inspect the dataframe structure (skimr)

Remember the skimr function skim?

# some variables 
heart_failure %>% skimr::skim( age, DEATH_EVENT ) 

# the whole dataframe
heart_failure %>% skimr::skim() 



You try…

Run skimr::skim() on your own either on the whole dataset or on any specific variable

  • notice there are no (missing values) NAs in any of the variables

Recode some variables for later ease of analysis

I may need some variables coded as factor (e.g. categorical variables for plotting), and, while I am at it, I can add clearer labels for the variables’ levels. Here, we are:

  • using tidyverse packages dplyr and forcats
  • adding new (recoded) variables called oldname_f
heart_failure <-heart_failure %>% 
  dplyr::mutate(DEATH_EVENT_f = as.factor(DEATH_EVENT) %>%
                  forcats::fct_recode("died" = "1", "survived" = "0")) %>% 
  dplyr::mutate(sex_f = as.factor(sex) %>%
                  forcats::fct_recode("male" = "1", "female" = "0"))

# check 
table(heart_failure$DEATH_EVENT_f)

survived     died 
     203       96 
table(heart_failure$sex_f)

female   male 
   105    194 

Some more dummy variables recoded as factor

[Mostly for illustration: it’s totally fine (if not preferable) to keep these as binary [0,1] variables]

  • It’s worth learning the useful function dplyr::across1, which allows to iteratively transform several columns at once!
# Recode as factor with levels "yes" (= 1), "no" (= 0)
fct_cols = c("anaemia", "diabetes", "high_blood_pressure", "smoking" )

heart_failure <- heart_failure  %>% 
  ## ---- 1st create new cols as "factor versions" of old cols
  dplyr::mutate(
    # let's introduce `across` function 
    dplyr::across(
      # Columns to transform
      .cols = all_of(fct_cols), 
      # Functions to apply to each col  
      .fns =  ~as.factor (.x),
      # new name to apply where "{.col}" stands for the selected column
      .names = "{.col}_f")) %>% 
  ## ---- 2nd create new cols as "factor versions" of old cols
  dplyr::mutate(
    dplyr::across(
      # Columns to transform 2 conditions 
      .cols = ends_with("_f") & !matches(c( "DEATH_EVENT_f", "sex_f" )) , 
      # Functions to apply to each col(different syntax)
      .fns = ~forcats::fct_recode(.x,  yes = "1", no = "0" )))

(Small digression on dplyr::across)

Notice how dplyr::across(.cols = ..., .fns = ..., .names = ...) has these arguments:

  1. .cols = to select the columns which we want to transform (i.e. fct_cols)
  2. .fns = ~function(.x) to specify the function
    • where ~function(.x) uses the “anonymous function” syntax of the tidyverse
    • and .x inside the function is a “stand in” for each of the columns selected
  3. [optional] .names = to name the new cols created using {.col} in place of each of the transformed columns
## ---- 1st create new cols as "factor versions" of old cols
heart_failure <- heart_failure  %>% 
  dplyr::mutate(
    dplyr::across(
      .cols = all_of(fct_cols), 
      .fns = ~as.factor (.x), 
      # (optional)
      .names = "{.col}_f")) %>% 
  ## ---- 2nd create new cols as "factor versions" of old cols
  dplyr::mutate(
    dplyr::across(
      .cols = ends_with("_f") & !matches(c( "DEATH_EVENT_f", "sex_f" )) , 
      .fns =  ~forcats::fct_recode(.x,  yes = "1", no = "0" )))

VISUAL DATA EXPLORATION FOR THE “HEART FAILURE”

CONTINUOUS VARIABLES

Why is visual exploration important?

  • Gaining insight on the variables (range, outliers, missing data)
  • Preliminary check of assumptions for parametric hypothesis testing:
    • normally distributed outcome variables?
    • homogeneity of variance across groups?

Let’s explore the Heart failure dataset with some data visualization…

  • Following the referenced articles (which were mostly interested in predict mortality based on patients’ characteristics), we will take the categorical, binary variable DEATH_EVENT_f as our main criterion to split the sample (into survived and dead patients) to explore any significant difference between groups in terms of means of known quantitative features.
  • We will look at both:
    • continuous variables in the dataset (with the Probability Density Function (PDF))
    • discrete variables in the dataset (with the Probability Mass Function (PMF))

Age

Introducing the handy R package patchwork which lets us compose different plots in a very simple and intuitive way

  • (check it out with ??patchwork)
age <-ggplot(heart_failure,aes(x = age ))+
  geom_histogram(binwidth = 5, color = "white", fill = "grey",alpha = 0.5)+
  geom_vline(aes(xintercept = mean(age)), color = "#4c4c4c")+
  theme_fivethirtyeight()+
  labs(title = "Age Distribution" )+
  scale_x_continuous(breaks = seq(40,100,5))  

age2 <-ggplot(heart_failure, aes(x = age, fill = DEATH_EVENT_f))+
  geom_histogram(binwidth = 5, position = "identity",alpha = 0.5,color = "white")+
  geom_vline(aes(xintercept = mean(age[DEATH_EVENT == 0])), color = "#4c4c4c")+
  geom_vline(aes(xintercept = mean(age[DEATH_EVENT==1])), color = "#d8717b")+
  theme_fivethirtyeight()+
  scale_fill_manual(values = c("#999999", "#d8717b"))+
  labs(title =  "Age Distribution by group (Death Event)")+
  scale_x_continuous(breaks = seq(40,100,5))

# patchwork
library(patchwork)
age + age2 + plot_layout(ncol = 1)

Age

As the age increases, the incidence of death event seems to increase

Creatinine Phosphokinase (CPK)

cpk <- ggplot(heart_failure,aes(x = creatinine_phosphokinase))+
  geom_density(fill = "gray", alpha = 0.5)+
  scale_x_continuous(breaks = seq(0,8000, 500))+
  geom_vline(aes(xintercept = mean(creatinine_phosphokinase)), color = "#4c4c4c")+
  theme_fivethirtyeight()+
  theme(axis.text.x = element_text(angle=50, vjust=0.75))+
  labs(title = "Creatinine phosphokinase (density distribution)" )+
  theme(plot.caption = element_text(hjust = 0.5, face = "italic"))

cpk2 <- ggplot(heart_failure,aes(x = creatinine_phosphokinase,fill = DEATH_EVENT_f))+
  geom_density(alpha = 0.5)+theme_fivethirtyeight()+
  scale_fill_manual(values = c("#999999", "#d8717b"))+
  scale_x_continuous(breaks = seq(0,8000, 500))+
  geom_vline(aes(xintercept = mean(creatinine_phosphokinase[DEATH_EVENT == 0])),
             color = "#4c4c4c")+
  geom_vline(aes(xintercept = mean(creatinine_phosphokinase[DEATH_EVENT==1])), 
             color = "#d8717b")+
  theme_fivethirtyeight()+
  theme(axis.text.x = element_text(angle=50, vjust=0.75))+
  labs(title =  "Creatinine phosphokinase (density distribution) by group (Death Event)")

cpk + cpk2 + plot_layout(ncol = 1)

Creatinine Phosphokinase (CPK)

This definitely doesn’t look like a normal distribution!

Ejection Fraction

ejf <- ggplot(heart_failure,aes(x = ejection_fraction))+
  geom_density(fill = "gray", alpha = 0.5)+
  geom_vline(aes(xintercept = mean(ejection_fraction)), color = "#4c4c4c")+
  theme_fivethirtyeight()+
  labs(title = "Ejection Fraction (density distribution)" )+
  theme(plot.caption = element_text(hjust = 0.5, face = "italic"))

ejf2 <- ggplot(heart_failure,aes(x = ejection_fraction,fill = DEATH_EVENT_f))+
  geom_density(alpha = 0.5)+theme_fivethirtyeight()+
  scale_fill_manual(values = c("#999999", "#d8717b"))+
  geom_vline(aes(xintercept = mean(ejection_fraction[DEATH_EVENT == 0])),
             color = "#4c4c4c")+
  geom_vline(aes(xintercept = mean(ejection_fraction[DEATH_EVENT==1])), 
             color = "#d8717b")+
  labs(title =  "Ejection Fraction (density distribution) by group (Death Event)")+
  theme_fivethirtyeight()

ejf + ejf2 + plot_layout(ncol = 1)

Ejection Fraction

This also doesn’t look like a normal distribution… and there is a remarkable change in the probability density function (PDF) shape when we introduce the grouping variable

Platelets

plat <- ggplot(heart_failure,aes(x = platelets))+
  geom_density(fill = "gray", alpha = 0.5)+
  geom_vline(aes(xintercept = mean(platelets)), color = "#4c4c4c")+
  theme_fivethirtyeight()+
  labs(title = "Platelets (density distribution)" )+
  theme(plot.caption = element_text(hjust = 0.5, face = "italic"))

plat2 <- ggplot(heart_failure,aes(x = platelets,fill = DEATH_EVENT_f))+
  geom_density(alpha = 0.5)+theme_fivethirtyeight()+
  scale_fill_manual(values = c("#999999", "#d8717b"))+
  geom_vline(aes(xintercept = mean(platelets[DEATH_EVENT == 0])),
             color = "#4c4c4c")+
  geom_vline(aes(xintercept = mean(platelets[DEATH_EVENT==1])), 
             color = "#d8717b")+
  labs(title =  "Platelets (density distribution) by group (Death Event)")+
  theme_fivethirtyeight()

plat + plat2 + plot_layout(ncol = 1)

Platelets

Here the probability distributions resemble a Normal one and we observe more uniformity in the mean/variance across the 2 groups

Serum Creatinine

ser_cr <- ggplot(heart_failure,aes(x = serum_creatinine))+
  geom_density(fill = "gray", alpha = 0.5)+
  geom_vline(aes(xintercept = mean(serum_creatinine)), color = "#4c4c4c")+
  theme_fivethirtyeight()+
  labs(title = "Serum Creatinine (density distribution)" )+
  theme(plot.caption = element_text(hjust = 0.5, face = "italic"))

ser_cr2 <- ggplot(heart_failure,aes(x = serum_creatinine,fill = DEATH_EVENT_f))+
  geom_density(alpha = 0.5)+theme_fivethirtyeight()+
  scale_fill_manual(values = c("#999999", "#d8717b"))+
  geom_vline(aes(xintercept = mean(serum_creatinine[DEATH_EVENT == 0])),
             color = "#4c4c4c")+
  geom_vline(aes(xintercept = mean(serum_creatinine[DEATH_EVENT==1])), 
             color = "#d8717b")+
  labs(title =  "Serum Creatinine (density distribution) by group (Death Event)")+
  theme_fivethirtyeight()

ser_cr + ser_cr2 + plot_layout(ncol = 1)

Serum Creatinine

Another continuous random variable with a non-normal distribution (long right tails) and a seemingly important difference in variance between the groups.

Serum Sodium

ser_sod <- ggplot(heart_failure,aes(x = serum_sodium))+
  geom_density(fill = "gray", alpha = 0.5)+
  theme_fivethirtyeight()+
  geom_vline(aes(xintercept = mean(serum_sodium)), color = "#4c4c4c")+
    labs(title = "Serum Sodium (density distribution)" )+
  theme(plot.caption = element_text(hjust = 0.5, face = "italic"))

ser_sod2 <- ggplot(heart_failure,aes(x = serum_sodium,fill = DEATH_EVENT_f))+
  geom_density(alpha = 0.5)+theme_fivethirtyeight()+
  scale_fill_manual(values = c("#999999", "#d8717b"))+
  geom_vline(aes(xintercept = mean(serum_sodium[DEATH_EVENT == 0])),
             color = "#4c4c4c")+
  geom_vline(aes(xintercept = mean(serum_sodium[DEATH_EVENT==1])), 
             color = "#d8717b")+
  labs(title =  "Serum Sodium (density distribution) by group (Death Event)")+
  theme_fivethirtyeight()

ser_sod + ser_sod2 + plot_layout(ncol = 1)

Serum Sodium

Same as above, except for the long left tails…

VISUAL DATA EXPLORATION FOR THE “HEART FAILURE”

DISCRETE VARIABLES

Anaemia

anem <- ggplot(heart_failure, aes(x = forcats::fct_infreq(DEATH_EVENT_f ), 
                                  fill = anaemia_f ))+
  geom_bar(position = "dodge")+
  ## add count labels
  geom_text(stat = "count", aes(label = ..count..),
            ## make labels suit the dodged bars 
            position=position_dodge(width = 1 ), 
            hjust=0.5, vjust=2,color = "white") +
  theme_fivethirtyeight() +
  #scale_x_discrete(labels  = c("Death Event:No","Death Event:Yes"))+
  scale_fill_manual(values = c("#af854f", "#af4f78"),
                    name = "Has Anaemia",
                    labels = c("No","Yes"))+
  labs(title = "Number of Patients with Anemia") + 
  theme(#axis.text.x = element_text(angle=50, vjust=0.75), 
    axis.text.x = element_text(size=12,face="bold"))     

anem

Anaemia

There seems to be a greater incidence of anaemia in group ‘died’

Diabetes

diab <- ggplot(heart_failure, 
               aes(x = forcats::fct_infreq(DEATH_EVENT_f ), fill = diabetes_f ))+
  geom_bar(position = "dodge")+
  ## add count labels
  geom_text(stat = "count", aes(label = ..count..),
            ## make labels suit the dodged bars 
            position=position_dodge(width = 1 ), 
            hjust=0.5, vjust=2,color = "white", size =4) +
  theme_fivethirtyeight() +
  #scale_x_discrete(labels  = c("Death Event:No","Death Event:Yes"))+
  scale_fill_manual(values = c("#af854f", "#af4f78"),
                    name = "Has Diabetes",
                    labels = c("No","Yes"))+
  labs(title = "Number of Patients with Diabetes") + 
  theme(#axis.text.x = element_text(angle=50, vjust=0.75), 
    axis.text.x = element_text(size=12,face="bold"))     

diab

Diabetes

Smoking

smok <- ggplot(heart_failure, aes(x = forcats::fct_infreq(DEATH_EVENT_f ), 
                                  fill = smoking_f ))+
  geom_bar(position = "dodge")+
  ## add count labels
  geom_text(stat = "count", aes(label = ..count..),
            ## make labels suit the dodged bars 
            position=position_dodge(width = 1 ), 
            hjust=0.5, vjust=2,color = "white", size =4) +
  theme_fivethirtyeight() +
  #scale_x_discrete(labels  = c("Death Event:No","Death Event:Yes"))+
  scale_fill_manual(values = c("#af854f", "#af4f78"),
                    name = "Patient smokes",
                    labels = c("No","Yes"))+
  labs(title = "Number of Patients who smoke") + 
  theme(#axis.text.x = element_text(angle=50, vjust=0.75), 
    axis.text.x = element_text(size=12,face="bold"))     

smok

Smoking

High blood pressure

hbp <- ggplot(heart_failure, aes(x = forcats::fct_infreq(DEATH_EVENT_f ), 
                                  fill = high_blood_pressure_f ))+
  geom_bar(position = "dodge")+
    ## add count labels
  geom_text(stat = "count", aes(label = ..count..),
            ## make labels suit the dodged bars 
            position=position_dodge(width = 1 ), 
            hjust=0.5, vjust=2,color = "white", size =4) +
  theme_fivethirtyeight() +
  #scale_x_discrete(labels  = c("Death Event:No","Death Event:Yes"))+
  scale_fill_manual(values = c("#af854f", "#af4f78"),
                    name = "Has high blood pressure",
                    labels = c("No","Yes"))+
  labs(title = "Number of Patients with High blood pressure") + 
  theme(#axis.text.x = element_text(angle=50, vjust=0.75), 
    axis.text.x = element_text(size=12,face="bold"))     

hbp

High blood pressure

There is also a greater incidence of high blood pressure in group ‘died’

HYPOTHESIS TESTNG - some examples -

Let’s continue to explore data from the heart failure patients’ dataset, but this time using hypothesis testing as we learned in Lecture 2. We will do two types of test:

  1. Comparing a sample against a hypothetical general population
  2. Testing if mean variables’ differences between the two groups of patients (those who survived after heart failure event and those who didn’t) is statistically significant

— EXAMPLE A —

(1 sample | n > 30 | Z test)

Comparing sample mean to a hypothesized population mean (with Z test)

Stating the above hypotheses more formally:

What is the population Total Platelet Count (TPC) mean for all people who suffered of heart failure (\(𝝁_{HF}\))?

  • \(𝑯_𝟎\) : there is no difference in mean TPC between patients who suffered heart failure and the general population
    • \(𝝁_{HF}\) = 236 -> hypothesis of no effect or (“no difference”)
  • \(𝑯_𝒂\) : there is a difference in mean TPC between patients who have suffered heart failure and the general population (“some effect”). This can be formalized as either:
    • \(𝝁_{HF}\) < 236 (one-sided test), or
    • \(𝝁_{HF}\) > 236 (one-sided test), or
    • \(𝝁_{HF}\) ≠ 236 (two-sided test)

1. Question: How does the mean platelets count in the patients’ sample compare a against a reference population?

# normalize the var for readability 
heart_failure  <-  heart_failure %>%  mutate(plat_norm = platelets/1000) 
# compute mean & sd for plot
mean_plat_p <- round(mean(heart_failure$plat_norm), digits = 1)
sd_plat_p <- round(sd(heart_failure$plat_norm), digits = 1)
 
heart_failure %>% 
  ggplot(aes(x = plat_norm))+
  geom_histogram(aes(y = ..density..), bins=30, alpha=0.25, colour = "#4c4c4c") + 
  geom_density(colour ="#9b2339", alpha=0.25, fill = "#9b2339") +
  # add mean vertical line
  geom_vline(xintercept = mean_plat_p, na.rm = FALSE,size = 1,color= "#9b6723") +
  # add also +/- 1sd  
  geom_vline(aes(xintercept = mean_plat_p + sd_plat_p), 
             color = "#23749b", size = 1, linetype = "dashed") +
  geom_vline(aes(xintercept = mean_plat_p - sd_plat_p), 
             color = "#23749b", size = 1, linetype = "dashed") +
  # add annotations with the mean value
  geom_label(aes(x=mean_plat_p,  y=0.0085, label=paste0("Sample mean\n",mean_plat_p)),
             color = "#9b6723") + 
  geom_label(aes(x=361,  y=0.0085, label=paste0("Sample sd\n",sd_plat_p)),
             color = "#23749b") +
  theme_bw() +  labs(y = "Density", x = "Sample platelet count (x 1000/µL)") 

1. Question: How does the mean platelets count in the patients’ sample compare a against a reference population?

For a general population, the Total Platelet Count (TPL) has 𝛍=236 (1000 /µL) and 𝛔= 59 (1000 /µL). Below is the sample distribution:

2.a Computation of the test statistic

In this case, we have:

  • a large sample \((n > 100)\)
  • a known \(𝛔^𝟐\) (of the reference population)
  • the observed sample mean \(\bar{x}\) and sample sd \(s\).

So we can compute:

\(𝒁_{calc}=\frac{\bar{x}-\mu}{\frac{\sigma}{\sqrt{n}}}\)

# General Population of reference 
mu <- 236 
sigma  <- 59
# Sample of HF patients
n <- 299
x_HF <- mean(heart_failure$plat_norm)         #    263.358
s_HF <- sd(heart_failure$plat_norm)           #    97.80424
# IF large sample & KNOWN pop variance 
std_err_HF <- sigma /sqrt(n)                  # 3.412058
z_calc_HF <-  (x_HF - mu) / std_err_HF        # 8.018043

2.b Computation of the p-value associated to the test statistic

To find the p-value associated with a z-score in R, we can use the pnorm() function, which uses the following syntax:

  • q: The z-score
  • mean: The mean of the normal distribution. Default is 0.
  • sd: The standard deviation of the normal distribution. Default is 1.
  • lower.tail:
    • If TRUE, the probability to the left of q in the normal distribution is returned
    • If FALSE, the probability to the right is returned. Default is TRUE.
# Left-tailed test
p_value_l <- stats::pnorm(z_calc_HF, mean = 0, sd = 1, lower.tail = TRUE) 
# Right-tailed test
p_value_r <- stats::pnorm(z_calc_HF, mean = 0, sd = 1,lower.tail = FALSE) 
# Two-tailed test  (our case)
p_value_two <- 2*stats::pnorm(z_calc_HF, mean = 0, sd = 1, lower.tail = FALSE)  

3. Results and interpretation

  1. Based on the critical region, z_calc_HF = 8.0180 falls in the CRITICAL REGION (well beyond the critical point)
# given 
z_critical  <- c(-1.96, +1.96) # (Z score corresponding to 𝛼  = 0.05)
# Check 
z_calc_HF > z_critical 
[1] TRUE TRUE
  1. Based on the p-value, p_value_two = 1.07443e-15 is much much smaller than \(\alpha\)
# Check
p_value_two <  0.05
[1] TRUE

DECISION: we reject the Null Hypothesis (basically we conclude that it is extremely unlikely that the sample we drew could have occurred just by chance). So the test indicates that, indeed, there is a difference between heart failure patients and the general population in terms of average platelets count.

— EXAMPLE B —

(1 sample | n < 30 | t test)

Comparing sample mean to a hypothesized population mean (with t test)

Same question, but with a smaller sample to work on (this varies, but generally it means \(n < 30\)). Imagine the patients were only observed over a follow-up period of 21 days, and also let’s assume we don’t know the population’s variance

Stating the hypothesis more formally:

What is the population Total Platelet Count (TPC) mean for all people who suffered of heart failure (\(𝝁_{HF21d}\)) in the past 21 days or less?

  • \(𝑯_𝟎\) : there is no difference in mean TPC between patients who suffered heart failure (visited in 21 days) and the general population
    • \(𝝁_{HF21d}\) = 236 -> hypothesis of no effect or (“no difference”)
  • \(𝑯_𝒂\) : there is a difference in mean TPC between patients who have suffered heart failure and the general population (“some effect”). This can be formalized as:
    • \(𝝁_{HF21d}\) ≠ 236 (two-sided test)

1. Question: How does the mean platelets count in the patients’ sample compare a against a reference population?

# normalize the var for readability 
heart_21d  <-  heart_failure %>%  mutate(plat_norm = platelets/1000) %>% 
  filter(time <= 21)                                # 23 obs 
# compute mean & sd for plot
mean_plat_p <- round(mean(heart_21d$plat_norm), digits = 1)
sd_plat_p <- round(sd(heart_21d$plat_norm), digits = 1)
 
heart_21d %>% 
  ggplot(aes(x = plat_norm))+
  geom_histogram(aes(y = ..density..), bins=30, alpha=0.25, colour = "#4c4c4c") + 
  geom_density(colour ="#9b2339", alpha=0.25, fill = "#9b2339") +
  # add mean vertical line
  geom_vline(xintercept = mean_plat_p, na.rm = FALSE,size = 1,color= "#9b6723") +
  # add also +/- 1sd  
  geom_vline(aes(xintercept = mean_plat_p + sd_plat_p), 
             color = "#23749b", size = 1, linetype = "dashed") +
  geom_vline(aes(xintercept = mean_plat_p - sd_plat_p), 
             color = "#23749b", size = 1, linetype = "dashed") +
  # add annotations with the mean value
  geom_label(aes(x=mean_plat_p,  y=0.014, label=paste0("Sample mean\n",mean_plat_p)),
             color = "#9b6723") + 
  geom_label(aes(x=361,  y=0.014, label=paste0("Sample sd\n",sd_plat_p)),
             color = "#23749b") +
  theme_bw() +  labs(y = "Density", x = "Sample platelet count (x 1000/µL)") 

1. Question: How does the mean platelets count in the patients’ sample compare a against a reference population?

For a general population, the Total Platelet Count (TPL) has 𝛍=236 (1000 /µL) and 𝛔= 59 (1000 /µL). Below is the smaller sample distribution:

2.a Computation of the test statistic

In this case, we have:

  • a “small” sample \(n = 23\)
  • an unknown \(𝛔^𝟐\) (of the reference population)
  • We obtained the sample mean \(\bar{x}\) and sample sd \(s\).

So we can compute:

\(t_{calc} =\frac{\bar{x}-\mu}{\frac{s_\bar{x}}{\sqrt{n-1}}}\)

# SAMPLE HF patients follow up less 21 days 
heart_21d <- heart_failure %>% 
  filter(time <= 21)                                # 23 obs 

n_21d <- nrow(heart_21d)                            # 23
x_HF_21d <- mean(heart_21d$plat_norm)               # 251.5094
s_HF_21d <- sd(heart_21d$plat_norm)                 # 102.7341
df_HF_21d <- n_21d-1                                # 22   

# IF large sample UNKNOWN sigma
std_err_HF_21d <- s_HF_21d /sqrt(n-1)                # 5.951226
t_stat_HF_21d <-  (x_HF_21d - mu) / std_err_HF_21d   # 2.606084

2.b Computation of the p-value associated to the test statistic

To find the p-value associated with a t-score in R, we can use the pt(q, df, lower.tail = TRUE) function, which uses the following syntax:

  • q: The t-score
  • df: The degrees of freedom
  • lower.tail:
    • TRUE to calculate the probability to the left of q which is called as left-tailed test
    • FALSE as right-tailed test.
# -- Left-tailed test
#pt(t_stat_HF_21d, df_HF_21d, lower.tail = TRUE)

# -- Right-tailed test
#pt(t_stat_HF_21d, df_HF_21d, lower.tail = FALSE) 

# -- Two-tailed test  (our case)
p_value_t_test <- 2*pt(t_stat_HF_21d, df_HF_21d, lower.tail = FALSE) # 0.01612665

3. Results and interpretation

  1. Based on the critical region, t_stat_HF_21d = 2.606084 falls in the CRITICAL REGION (well beyond the critical point)
#find two-tailed t critical values

t_crit_two <- qt(p=.05/2, df=22, lower.tail=FALSE)    # 2.073873
# Compare t score against t critical    
t_stat_HF_21d > t_crit_two  # TRUE 
[1] TRUE
  1. Based on the p-value, p_value_two = 0.01612665 is smaller than \(\alpha\)
# Check 
p_value_t_test <  0.05  # TRUE 
[1] TRUE

DECISION: we reject the Null Hypothesis (basically we conclude that it is unlikely that the sample we drew could have occurred just by chance). So the test indicates that indeed there is a difference between heart failure patients visited within 21 days and the general population in terms of average platelets count.

Note

What changed testing a sample with smaller n, instead of a large one?

— EXAMPLE C —

(2 samples | t test)

Comparing two independent sample means (t test)

This time, we investigate if there a statistically significant difference between the Platelet Count means of the patients who died and of the patients who survived heart failure.

Stating the above hypotheses more formally:

Is there a statistically significant difference between the mean values of two groups?

  • \(𝑯_𝟎\) : The two population means are equal
    • \(𝝁_𝟏 = 𝝁_𝟎 ⟺ 𝝁_𝟏−𝝁_𝟎=𝟎\)
  • \(𝑯_𝒂\) : There is a mean difference between the two groups in the population. Possible directional difference formulation (two-tailed, left-tailed, right-tailed)
    • \(𝝁_𝟏≠𝝁_𝟎 ⟺ 𝝁_𝟏−𝝁_𝟎≠𝟎\) (the two population means are not equal)
    • \(𝝁_𝟏 < 𝝁_𝟎 ⟺ 𝝁_𝟏−𝝁_𝟎<𝟎\) (population 1 mean is less than population 0 mean)
    • \(𝝁_𝟏 > 𝝁_𝟎 ⟺ 𝝁_𝟏−𝝁_𝟎>𝟎\) (population 1 mean is greater than population 0 mean)

Comparing two independent sample means (t test) (cont.)

1. Question: is there a statistically significant difference between the Platelet Counts in the patients who died v. survived heart failure?

# boxplot by group
heart_failure %>% 
  ggplot(mapping = aes(y = plat_norm, x = DEATH_EVENT_f, fill = DEATH_EVENT_f)) +
  geom_boxplot(alpha=0.5)+ 
  #geom_violin(alpha=0.5) +
  geom_point(position = position_jitter(width = 0.1), size = 0.5)+ 
  scale_fill_manual(values = c("#999999", "#d8717b"))  +
  # drop legend and Y-axis title
  theme(plot.title = element_text(size = 14,face="bold", color = "#873c4a"),
        legend.position = "none",
        axis.text.x = element_text(size=12,face="bold"), 
        axis.text.y = element_text(size=12,face="bold")) + 
  labs(title = "Boxplot of Total Platelet Count (TPL), grouping by DEATH_EVENT [0,1]",
       x = "", y  = "Platelet count (1000 /µL)")

Comparing two independent sample means (t test) (cont.)

There seems to be no major difference in the two groups

2. Verify the assumptions for independent t-test

  1. The 2 samples (“died” and “survived”) must be independent ✅
  2. The dependent variable is scaled in intervals (Platelets Count in 10^3 “/µL”) ✅
  3. The dependent variable is normally distributed (Platelets Count in 10^3 “/µL”) ✅
  • (If not, use non parametric test)
  1. The variance within the 2 groups should be similar ❓
  • (If not, perform Welch’s t-test)

Preliminary Fisher’s F test to check for variance equality

  • We can compute the Fisher test manually
## -- data by group
n_died <- nrow(heart_failure[heart_failure$DEATH_EVENT == 1 ,])
mean_died <- mean(heart_failure [ heart_failure$DEATH_EVENT == 1,  "plat_norm"])
sd_died <- sd(heart_failure [heart_failure$DEATH_EVENT == 1 ,  "plat_norm"])
var_died <- var(heart_failure [heart_failure$DEATH_EVENT == 1 ,  "plat_norm"])

n_survived <- nrow(heart_failure[heart_failure$DEATH_EVENT == 0, ])
mean_survived <- mean(heart_failure [ heart_failure$DEATH_EVENT == 0,  "plat_norm"])
sd_survived <- sd(heart_failure [heart_failure$DEATH_EVENT == 0 ,  "plat_norm"])
var_survived <- var(heart_failure [heart_failure$DEATH_EVENT == 0 ,  "plat_norm"])

## -- F TEST
F_ratio <- var_died / var_survived
F_ratio  # 1.020497 
[1] 1.020497

Preliminary Fisher’s F test to check for variance equality (.cont)

## -- Define the critical value of F distribution for a risk of alpha = 0.05
# qf(p=.05, df1 = n_died-1, df2 = n_survived-1, lower.tail = FALSE) # RIGHT-Tailed
# qf(0.95, df1 = n_died-1, df2 = n_survived-1, lower.tail = FALSE) # LEFT- Tailed 
qf(c(0.025, 0.975), df1 = n_died-1, df2 = n_survived-1) # TWO-Tailed 
[1] 0.6994659 1.3987233
## --Compute the exact p-value (two-tailed )
p_value_f <- 2 * (1 - pf(F_ratio, df1 = (n_died-1), df2 = (n_survived-1))) 
p_value_f
[1] 0.8914982

A test statistic (F) of 1.02 is obtained, with degrees of freedom 95 and 202.

The p-value is 0.89, greater than the p-value threshold of 0.05. This suggests we can not reject the null hypothesis of equal variances.

The variance within the 2 groups should be similar ✅ –> we can run a t-test.

3.a Computation of t test statistic

Since we verified the required assumptions, the test method is the independent (two-sample) t-test. In this case, we have:

  • a large sample \((𝐧_𝟏 +𝐧_𝟐 > 100)\)
  • the population variance(s) are unknown, but we can assume = variances in 2 groups
  • \(standard\, error\) of the means’ difference is obtained as pooled estimate standard deviation of the sampling distribution of the difference

So we can compute: \(t_{calc} = \frac{Difference\,Between\,Sample\,means}{Std.\,Err.\,of\,the\,difference} = \frac{\bar{x_1} -\bar{x_2}}{\sqrt{\frac{s_{1}^{2}}{n_{1}}+\frac{s_{2}^{2}}{n_{2}}}}\)

# Step 1 - compute difference of sample means
mean_diff <- (mean_died - mean_survived) # -10.27645 

# Step 2 - Compute associated t-statistics
# pooled std error 
pooled_stderror <- sqrt(sd_died^2/(n_died ) + sd_survived^2/(n_survived )) 
# pooled std error corrected
pooled_stderror_corr <- sqrt(sd_died^2/(n_died-1) + sd_survived^2/(n_survived-1)) 

###  t statistic  
t_calc <- (mean_died - mean_survived) / pooled_stderror_corr 

3.b Computation of the p-value associated to the t statistic

# Step 3 - degrees of freedom
# n1 + n2 - number of estimated parameters (2 means)
d_f <- n_died + n_survived - 1 - 1 # 297

# Step 4 - Deduced p-value
p_value <- 2 * pt(t_calc, df = d_f) # 0.4009635
p_value
[1] 0.4009635

4. Results and interpretation

  1. Looking at the confidence interval of the difference, the sample mean_diff is well inside the 95% CI of = population mean
mean_diff
[1] -10.27645
# CI of the means difference 
CI_lower <- mean_diff + qt(.025, sum(n_died + n_survived) - 2) * pooled_stderror_corr  
CI_lower
[1] -34.32074
CI_upper <- mean_diff + qt(.975, sum(n_died + n_survived) - 2) * pooled_stderror_corr  
CI_upper
[1] 13.76785
  1. As for the p-value, p_value = 0.40 is bigger than threshold probability \(\alpha\)
# Check 
p_value
[1] 0.4009635
p_value <  0.05  # FALSE 
[1] FALSE

DECISION: So, we fail to reject the null hypothesis of equal populations means of TPC. So the test indicates that we do not have sufficient evidence to say that the mean counts of platelets in between these two populations is different.

— EXAMPLE D —

(3+ samples | ANOVA test)

Comparing sample means from 3 or more groups (ANOVA)

In this example, we adopt the ANOVA (“Analysis Of Variance”) test, i.e. an extension of the previous test, but examined how means of a variable differ across 3 or more groups. We will use ‘one- way’ ANOVA, which serves when there is only one explanatory variable (“treatment”) with 3 or more levels, and only one level of treatment is applied for a given subject.

For this particular case, we use another realistic dataset showing the survival times of 33 laboratory mice with thymic leukemia who were randomly divided into 3 groups:

  • 1st group received Treatment 1
  • 2nd group received Treatment 2
  • 3rd group as Control
# load new dataset
mice <- readxl::read_excel(here::here("practice","data_input",
                                      "02_datasets","mice_exe_ANOVA.xlsx"))

1. Question:Is there a statistically significant difference between the mean values of the k populations?

1. Question: Is there sufficient evidence to confirm the belief that at least one of the two treatments affects the average survival time of mice with thymic leukemia?

More formally:

  • \(𝑯_𝟎\) : \(𝝁_𝟏 = 𝝁_𝟐 = 𝝁_3\) all 3 population means are equal
  • \(𝑯_𝒂\) : at least one of \((𝝁_𝟏,𝝁_𝟐,𝝁_3)\) is not equal to the other means
# boxplot by group
mice %>% 
ggplot(., aes(x = group, y = surv_days, fill = group)) +
  geom_boxplot() + 
  scale_fill_viridis(discrete = TRUE, alpha=0.6, option="A") +
  geom_jitter(color="black", size=0.4, alpha=0.9) +
  # theme_minimal() +
  # drop legend and Y-axis title
  theme(plot.title = element_text(size = 14,face="bold", color = "#873c4a"),
        axis.text.x = element_text(size=12,face="bold"), 
        axis.text.y = element_text(size=12,face="bold"),
        legend.position = "none",
        ) + 
  labs(title = "Visually check mean and variance in populations' samples" ) + 
  ylab(label = "Survival (# days") + xlab(label = "")

1. Question:Is there a statistically significant difference between the mean values of the k populations?

The boxplot suggests that the 3 groups might have some fairly different distributions

2. Verify the assumptions for one-way ANOVA

The dependent variable is on a metric scale. In the case of the analysis of variance, the independent variable (factor) has at least three levels.

Assumptions for the results of a one-way ANOVA to be valid:

  1. Independence of observations – The observations in each group are independent of each other and the observations within groups were obtained by a random sample. ✅
  2. Normally-distributed response variable – The values of the dependent variable follow a normal distribution. ❓
  3. Homogeneity of variance – The variances of the populations that the samples come from are equal. ❓

Preliminary check for normality (visual)

  1. Normally-distributed response variable ✅ (confirmed by visual inspection )

Preliminary check for normality (test)

  1. Normally-distributed response variable ✅ (confirmed by Shapiro-Wilk normality test) [The null hypothesis of this test is that “sample distribution is normal”]
# Shapiro-Wilk Normality Test to verify normality  
SW_t <- mice %>%
  group_by(group) %>%
  rstatix::shapiro_test(surv_days)

SW_t
# A tibble: 3 × 4
  group       variable  statistic     p
  <chr>       <chr>         <dbl> <dbl>
1 Control     surv_days     0.994 0.999
2 Treatment 1 surv_days     0.957 0.611
3 Treatment 2 surv_days     0.979 0.960

Preliminary check variance equality

  1. Homogeneity of variance – ✅ (Besides visual inspection, confirmed by Levene test for variance equality)

[The null hypothesis H0 = several groups have the same variance (possible variance differences occur only by chance, since there are small differences in each sampling)]

# Levene test for variance equality
levene <- mice %>%                               # name of the data
  car::leveneTest(surv_days ~ as.factor(group),   # continuous DV ~  group IV
                  data = .,            # pipe the data from above
                  center = mean)       # default is median 
levene
Levene's Test for Homogeneity of Variance (center = mean)
      Df F value Pr(>F)
group  2  0.1721 0.8427
      30               

No evidence of violations of HOV were found, since the p-value for the Levene test (= 0.8427157) is greater than .05, then the variances are not significantly different from each other (i.e., the homogeneity assumption of the variance is met).

3 Computation of ANOVA F-ratio

ANOVA in R can be done in several ways.

Since it’s quite straightforward, let’s do all the steps by hand first. We need to obtain the needed “ingredients” to calculate the F-ratio:

\[𝑭_{calc}=\frac{Mean\, Square\, Between}{Mean, Square\, Within}= \frac{MSB}{MSW} = \frac{\frac{SSB}{df1}}{\frac{SSW}{df2}} \]

3.a Computation of ANOVA F-ratio (manually)

  • Option 1: Let’s compute the ANOVA test “manually”
# Summary statistics
mice_calc <- mice %>% 
  mutate(mean_all = mean(surv_days),
         sd_all = sd (surv_days),
         dfw = 33-3, # df1 = n-k
         dfb = 3-1, # df2 = K−1 
         group_f = as.factor(group)
         ) %>% 
  group_by(group) %>% 
  mutate(n_group = n(),
         mean_group = mean(surv_days),
         sd_group = sd (surv_days)) %>% 
  ungroup() %>% 
  mutate (ST = (surv_days - mean_all)^2,
          SW = (surv_days - mean_group)^2,
          SB = (mean_group - mean_all)^2)

# Sum of Squares 
SST <- sum(mice_calc$ST)
SSB <- sum(mice_calc$SB)
SSW <- sum(mice_calc$SW)
dfw <- 33-3  # df2
dfb <- 3-1 # df1

# calculated F statistic 
F_calc <- (SSB/dfb)/(SSW/dfw) # 5.65
# F critical value
F_crit <- qf(p = 0.01, df1 = 2, df2 = 30, lower.tail = FALSE) # 5.390346

3.b Computation of ANOVA F-ratio (with R functions)

That was just to show how to build it step-by-step (🤓) but we don’t have to! We have alternative R functions that can do ANOVA for us:

  • Option 2: With the stats::aov followed by the command summary
aov_1 <- stats::aov(surv_days ~ group_f,
                 data = mice_calc)
summary(aov_1) 
            Df Sum Sq Mean Sq F value  Pr(>F)   
group_f      2  434.6  217.32   5.652 0.00826 **
Residuals   30 1153.4   38.45                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
aov_2 <- stats::oneway.test(surv_days ~ group_f,
            data = mice_calc,
            # assuming equal variances
            var.equal = TRUE)
aov_2

    One-way analysis of means

data:  surv_days and group_f
F = 5.6522, num df = 2, denom df = 30, p-value = 0.008258

4. Results and interpretation

All 3 options have given the same results, i.e., F-ratio = 5.652 and a p-value = 0.00826

DECISION: Given that the p-value is smaller than 0.05, we reject the null hypothesis, so we reject the hypothesis that all means are equal. Therefore, we can conclude that at least one group is different than the others in mean number of survival days.

Note

Have you seen the kind of notation Pr(>F) 0.00826 ** before (as in the output of the stats::aov function)?

A CLOSER LOOK AT TESTING ASSUMPTIONS

Testing two groups that are not independent

— ???? EXAMPLE ???? —

Testing if the data are not normally distributed: non-parametric tests

1. Question: …….?

2.a Computation of the test statistic

2.b Computation of the p-value associated to the test statistic

3. Results and interpretation

— EXAMPLE E —

(2 samples no normal | Wilcoxon Rank Sum Test) ## Testing samples without normality assumption

1. Question: …….?

2.a Computation of the test statistic

2.b Computation of the p-value associated to the test statistic

3. Results and interpretation

— EXAMPLE F —

(2 samples no HOV | t test with the Welch correction )

Testing samples without homogeneous variance of observations assumption

1. Question: …….?

2.a Computation of the test statistic

2.b Computation of the p-value associated to the test statistic

3. Results and interpretation

Final thoughts/recommendations

  • Always read the documentation (?package::function, especially the examples at the bottom)

  • Always inspect the data / variables before and after making changes

  • Better rename (i.e. create a new R object) when you recode/manipulate a column or a dataset

    • (this promotes reproducibility, because you will be able to retrace your coding steps)
  • Always plot distributions for visual data exploration

  • Make changes in small increments (like we saw in building ggplot2 graph in subsequent layers)

References

Chicco, D., & Jurman, G. (2020). Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone. BMC Medical Informatics and Decision Making, 20(1), 16. dataset. https://doi.org/10.1186/s12911-020-1023-5
Wongsaengsak, S., Dennis, J., Arevalo, M., Ball, S., & Nugent, K. (2019). Significance of platelet counts in health and disease: Insights from a population study using data from the National Health and Nutrition Examination Survey. The Southwest Respiratory and Critical Care Chronicles, 7(30), 4–11. dataset. https://doi.org/10.12746/swrccc.v7i30.558